PSCI 2270 - Week 3
Department of Political Science, Vanderbilt University
September 17, 2024
Learning about population from sample
Descriptive statistics
Some math…
We are usually interested in making inferences about groups of units, population
To do so we collect multiple individual measurements
Once we we aggregate the formula becomes estimate = estimand + noise + bias
We often cannot survey or measure outcome among the whole set of units we are interested in \(\Rightarrow\) Target population
We then have to resort to a subset of units that we can reasonably collect data for \(\Rightarrow\) Sample
We collect the sample from the available list that ideally includes the whole population \(\Rightarrow\) Sampling frame
Reliability and validity are still a concern
In addition sampling brings more sources of bias:
Simple random sampling: Every unit has an equal selection probability
e.g. random digit dialing (RDD):
…or random walk (?)
Literary Digest predicted elections using mail-in polls
Primary source of addresses: Automobile registrations, phone books, country club memberships
In 1936, sent out 10 million ballots, over 2.3 million returned
| Pollster | FDR’s Vote Share |
|---|---|
| Literary Digest | 43% |
| George Gallup | 56% |
| Actual Outcome | 62% |
Ballots skewed toward the wealthy (with cars, phones) \(\Rightarrow\) selection bias
| Pollster | Truman | Dewey | Thurmond | Wallace |
|---|---|---|---|---|
| Crossley | 45% | 50% | 2% | 3% |
| Gallup | 44% | 50% | 2% | 4% |
| Roper | 38% | 53% | 5% | 4% |
| Actual Outcome | 50% | 45% | 3% | 2% |
Quota sampling:
Most polls concluded ~2 weeks prior to the elections \(\Rightarrow\) selection bias
Republicans easier to interview within quotas (phones, listed addresses, etc.) \(\Rightarrow\) non-response bias
Descriptive (summary) statistics are numerical summaries of those observations
Two salient features of a variable that we want to know:
\[ \color{#98971a}{\bar{x}} = \color{#d65d0e}{\frac{1}{n}} \color{#458588}{\sum_{i = 1}^{n} x_{i}} \]
What’s all this notation?
Applied to the mean:
Median more robust to outliers:
Quantile (quartile, quintile, percentile, etc):
Interquartile range (IQR): a measure of variability
One definition of outliers: over 1.5 × IQR above the upper quartile or below lower quartile
\[ \text{sd} = \color{#cc241d}{\sqrt{\color{#b16286}{\frac{1}{n - 1}} \color{#98971a}{\sum_{i = 1}^{n}} \color{#458588}{(}\color{#d65d0e}{x_i - \bar{x}}\color{#458588}{)^2} }} \]
Steps:
Learning about population from sample
Descriptive statistics
Probability:
Law of Large Numbers
Central Limit Theorem:
In real data, we will have a set of \(n\) observations of a variable: \(X_1\) , \(X_2\), … , \(X_n\)
Empirical analyses: summary of these \(n\) observations
Law of Large Numbers (LLN)
Let \(X_1\) , … , \(X_n\) be i.i.d. random variables with mean \(\mu\) and finite variance \(\sigma^2\). Then, \(\bar{X}_{n}\) converges to \(\mu\) as \(n\) gets large.
The normal distribution is the classic “bell-shaped” curve.
Three key properties:
Central Limit Theorem (CLT)
Let \(X_1\) , … , \(X_n\) be a statistical sample from population with mean \(\mu\) and variance \(\sigma^2\). Then, \(\bar{X}_n\) (sample mean) will be approximately distributed \(N ( \mu, \sigma^2 / n )\) as \(n\) goes to infinity.
Approximation is better as sample size goes up
Important result: We now know how far away \(\bar{X}_n\) can be from its mean!
We usually only have one sample, so we’ll only get one sample mean. So why do we care about LLN/CLT?
\[ SE = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]
Think about possible data strategies for answering question: Which factors affect election participation?
Applying CLT/LLN to get point estimates and estimates of uncertainty
Comparing group means and logic of causal inference